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A  THEORETICAL  STUDY  OF  TWO-STAGE  TESTING 
Frederic  M.  Lord 
Educational  Testing  Service 

\  ABSTRACT 

-  ■  —  — — — 

< 

When  items  cannot  be  answered  correctly  by  guessing,  certain  two- 
stage  testing  procedures  are  about  as  effective  over  the  ability  range 
of  interest  as  the  ’^best"  up-and-down  procedures  studied  previously. 
When  answers  can  be  guessed  correctly  20  percent  of  the  time,  no  two- 
stage  procedure  is  found  to  match  the  "best"  up-and-down  procedures 
over  this  ability  range.  Feet -on-the -desk  designs  for  two-stage  pro¬ 
cedures  may  produce  poor  results. 


A  THEORETICAL  STUDY  OP  TWO -STAGE  TESTING1 
Frederic  K.  Lord 
Educational  Testing  Service 

A  two -stage  testing  procedure  consists  of  a  routing  test  followed 
by  one  of  several  alternative  second-stage  tests.  All  tests  are  of 
conventional  type.  The  choice  of  the  second-stage  test  administered 
is  determined  by  the  examinee's  score  on  the  routing  test. 

The  main  advantage  of  such  a  procedure  lies  in  matching  the  diffi¬ 
culty  level  of  the  second  test  to  the  ability  level  of  the  examinee. 

Since  conventional  tests  are  usually  at  a  difficulty  level  suitable 
for  typical  examinees  in  the  group  tested,  two-stage  testing  procedures 
are  likely  to  be  advantageous  chiefly  at  the  extremes  of  the  ability 
range . 

Two-stage  testing  is  discussed  by  Cronbach  and  Gleser  (1965, 
chapt.  6),  using  a  decision  theory  approach.  They  deal  primarily  with 
a  situation  where  examinees  are  to  be  selected  or  rejected.  Their  ap¬ 
proach  is  chiefly  sequential  in  the  special  sense  that  tne  second-stage 
test  is  adminJ  tered  only  to  borderline  examinees.  All  advantages  of 
this  procedure  come  from  varying  the  amount  of  testing  according  to 
the  ability  level  of  the  examinee. 

In  contrast,  the  present  paper  is  concerned  with  situations  where 
the  immediate  purpose  of  the  testing  is  measurement,  not  class if icatior. 

In  this  paper,  the  total  number  of  test  items  administered  to  a  single 
examinee  is  fixed.  Any  advantage  of  two-stage  testing  appears  as 

xThis  work  was  supported  in  part  by  contract  N -00014 -69-C-OOI7  between 
the  Personnel  and  Training  Research  Programs  Office,  Psychological  Sciences 
Division,  Office. of  Naval  Research  and  Educational  Testing  Service.  Repro¬ 
duction  in  whole  or  in  part  is  permitted  for  any  purpose  of  the  United 
States  Government . 
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improved  measurement.  Some  empirical  studies  of  such  two-stage  testing 
are  reported  by  Linn,  Rock,  and  Cleary  (1969),  who  also  cite  other 
references. 

The  peasant-  study  attempts  to  find,  under  specified  restrictions, 
some  good  designs  for  two-stage  testing.  A  "good1  procedure  is  one  that 
provides  reasonably  accurate  measurement  for  examinees  who  would  obtain 
near-perfect  or  near-zero  (or  near-chance -level)  scores  on  a  conventional 
test. 

The  particulars  at  our  disposal  in  designing  a  two-stage  testing 
procedure  include  the  following: 

1.  the  number  of  items  given  to  a  single  examinee  (  n  ), 

2.  the  number  of  alternative  second-stage  tests  available  for 
use, 

3.  the  number  of  alternative  responses  per  item, 

4.  the  number  of  items  in  the  routing  test  (  ), 

5.  the  difficulty  level  of  the  routing  test, 

6.  the  method  of  scoring  the  routing  test, 

7.  the  cutting  points  for  deciding  which  second-strge  test  an 
examinee  will  take, 

3.  the  difficulty  levels  of  the  second-stage  tests, 

9.  the  method  of  scoring  the  entire  two-stage  procedure. 

It  does  not  seem  feasible  to  locate  truly  "optimum"  designs.  The 
present  study  has  proceeded  by  investigating  several  designs,  modifying 
the  best  of  these  in  various  ways,  choosing  the  best  of  the  modifications, 
and  continuing  in  this  fashion  as  long  as  any  modification  can  be  found 
that  noticeably  improves  results. 


-3- 


Nearly  300  different  two -stage  designs  have  been  investigated  in 
this  process.  Obviously,  an  empirical  investigation  o±  200  designs 
would  nave  been  out  of  the  question.  Instead,  theoretical  Investiga¬ 
tions  were  carried  out,  with  the  aid  of  a  high-speed  computer,  based 
on  item  characteristic  curve  theory. 

Specifications 

Let  us  start  by  restricting  our  attention  to  tests  composed  of 
dichotomously  scored  items.  The  mathematical  model  to  be  used  assumes 
that  Pi  =  P^fe)  ,  the  probability  of  a  correct  response  to  item  i  , 
is  a  generalized  normal-ogive  function  of  the  examinee's  ability  (or 
standing  on  the  trait  measured): 

P^G)  =  cA  +  (1  -  ci)«[a1(e  -  b^J  ,  (1) 

where  C>(t)  represents  the  normal  distribution  cumulative  frequency  up 
to  the  relative  deviate  t  .  This  assumes  that  the  items  to  be  used 
are  all  homogeneous  in  the  sense  that  they  all  measure  the  same 
psychological  trait. 

The  quantities  aA  ,  bt  ,  and  c^  are  parameters  describing 
item  i  .  The  ogive  P^e)  has  its  point  of  Inflection  at  Q  *  b^  . 

As  e  becomes  negatively  large,  P^(e)  approaches  its  lower  asymptote 
Pi  =  ci  .  For  fixed  c^  ,  the  slope  at  the  point  of  inflection  is 
proportional  to  a^  .  Thus  a^  is  thought  of  as  representing  item 
discriminating  power,  b^  as  representing  item  difficulty,  and 


as  a  sort  of  practical  chance -score  level.  A  detailed  discussion  of 
these  parameters  from  the  present  point  of  view  is  given  by  Lord 
(1969#  sections  5,  ^)* 

For  the  sake  of  simplicity,  let  us  assume  that  the  available  items 
differ  only  in  difficulty,  b^  .  They  all  have  equal  discriminating 
power,  denoted  by  a  ,  and  equal  practical  chance-score  levels,  c  • 

Also,  let  us  consider  the  case  where  the  routing  test  and  each  of  the 
second-stage*  tests  are  peaked:  that  is,  each  subtest  is  composed  of 
items  all  of  equal  difficulty . 

Scoring 

For  a  peaked  test,  it  is  known  (Birnbaum,  1968,  chapter  2d)  that 
the  number -right  score  (nunfcer  of  right  answers),  to  be  denoted  by  x  , 
is  a  sufficient  statistic  for  estimating  an  examinee's  ability  0  . 

Thus  at  first  sight  it  might  seem  that  there  is  no  problem  in  scoring 
a  two-stage  testing  procedure  when  all  sub  tests  are  peaked.  However,  it 
is  clear  that  different  estimates  of  0  should  be  used  for  examinees 
who  obtain  the  same  number-right  score,  but  on  different  second-stage 
tests  having  different  difficulty  levela* 

What  it  needed  it  to  find  a  function  of  the  sufficient  statistic 
x  that  is  an  unbiased  estimator,  or  at  least  a  consistent  estimator, 
of  0  .  The  maximum  likelihood  estimator,  to  bs  denoted  by  0  ,  satisfies 
thsss  requirements,  end  will  be  used  hare* 
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For  an  m-item  peaked  subtest,  the  likelihood  function  is 

L(*lo)  -  (S)pV'X  ,  (2) 

where  P  le  the  item  characteristic  function  defined  by  (l)  for  each  of 
the  m  items,  and  where  Q  »  1  -  P  .  Differentiate  the  logarithm  of 
the  likelihood 

log  L(x|e)  ■  log($)  +  x  log  P  +  (m  -  x)log  Q  (3) 

with  respect  to  $  to  obtain 

o$  P  Q 

•  jg  (*  -  mP)  ,  (4) 

where  P'  is  the  derivative  of  P  with  respect  to  6.  When  (*)  is 
set  equal  to  sero,  we  obtain  the  likelihood  equation 

K>) - f  • 

Substituting  (3)  into  (l)  and  solving  for  •  ,  we  have 

there  a  ,  b  ,  and  e  describe  each  item  of  the  peeked  sebtest*  fits 
■arlmnai  likelihood  estimator  is  found  by  solving  for  $  i 


t 


(6) 


*  1  *-l,x/m  -  c%  .  . 

9  -  -  *  (  i“7  c  )  +  b  »  (6) 

where  t”1  is  the  inverse  of  the  function  9  (  9"1  is  the  relative 
deviate  corresponding  to  a  given  normal  curve  area). 

Equation  (6)  gives  a  sufficient  statistic  that  is  also  a  consistent 
estimator  of  $  having  minimum  variance  in  large  samples.  The  separate 
use  of  (6)  for  the  routing  test  and  for  the  second-stage  test  yields  two 
such  estimates,  and  S2  t  for  «ny  given  examinee.  These  are  jointly 
sufficient  statistics  for  9  •  They  must  be  combined  into  a  single 
estimate.  However,  there  is  no  uniquely  good  way  to  do  this. 

In  the  present  study,  i1  and  92  ar®  averaged  after  weighting 
them  inversely  according  to  their  (estimated)  large-sample  variances. 

This  is  the  weighting  that  produces  a  consistent  estimator  with  minimum 
large-sample  sampling  variance.  Thtu*  an  examinee's  score  §  on  the 
two-stage  test  will  be  proportional  to 


Var  9^  Var  $2 


Specifically,  his  overall  score  is  defined  as 


where  9  is  an  estimate  of  Var  ,  so  that  asymptotically 

9  fir  $9  *  $  Var  $. 

«9- - JE - --*.i  . 

Vir  ♦  Var  $2 
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By  a  well  known  theorem, 


Var  0  -  {£[- 


L(xil 

30 


'■fr1 


From  (k),  then, 


Var  0  =  S(x  "  mP)2}"1 

P  Q 


By  (2),  x  has  a  binomial  distribution  with  mean  mP  and  variance 


2 

6(x  -  mP)  =  mPQ  , 


so  that 


Var  0  «=  .  (9) 

mP* 


By  (l), 

P’  -  (1  *  c)as[a(<?  -  b^j  (10) 

where  < (t)  is  the  normal  curve  ordinate  at  the  relative  deviate  t  . 

In  practice,  $(8^)  and  9(8g)  were  obtained  by  substituting  8^  or 
§2  t  respectively,  for  $  in  the  risht-hand  sides  of  (9)  and  (10). 

When  x  *  n  or  x  ■  cm  ,  the  8  defined  by  (6)  would  be  infinite. 
To  avoid  this,  whenever  x  »  »  ,  x  was  in  practice  replaced  by 
x  *  m  -  l/2  .  Whenever  x  <  cm  and  x  ♦  1  >  cm  ,  the  lower  of  these 
two  scores  was  replaced  by  (x  +  1  ♦  c*)/2  .  At  the  some  time,  all  other 
•cores  lower  than  (x  ♦  1  +  cm)/2  were  also  replaced  by  (x  ♦  1  ♦  em)/2  . 
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The  score  9  so  constructed  from  (7)  will  not  have  strictly  optimum 
properties  for  small  n  ;  however,  this  is  typical  of  estimation  problems 
where  (as  here)  no  single  sufficient  estimator  exists •  Two-stage  testing 
is  on  its  face  a  rather  inefficient  method  of  tailored  testing.  Any 
additional  inefficiency  from  the  use  of  0  should  be  of  relatively 
minor  importance. 

Evaluation  of  Procedures 

If  there  are  n1  items  in  the  routing  test  and  n^  items  in  the 

second-stage  test,  there  are  at  most  different  possible’  numerical 

values  for  9  .  Let  0  denote  the  value  of  0  when  the  number-right 

xy 

scores  on  the  routing  test  and  on  the  second-stage  test  are,  respectively, 
x  and  y  .  By  (2),  the  frequency  distribution  of  0  is 

«**»  -  *J»>  -  (?)  W  (?)  mi 

where  Pj  is  given  by  (l)  with  a^  =  a  ,  ^  =  c  ,  and  b.^  equal  to  the 

difficulty  level  (  b  ,  say)  of  the  routing  test;  and  where  is 
similarly  given  by  (l)  with  b^  =  b(x)  ,  a  numerical  function  of  x 
assigned  in  advance  by  the  psychometrician. 

Given  numerical  values  for  n1  ,  ng  ,  a  ,  b  ,  c  ,  and  for  b(x)  , 
x  =  0,1, ...,n^  ,  the  exact  frequency  distribution  of  the  examinee's  score 
0  for  examinees  at  any  given  ability  level  0  can  be  computed  from  (ll). 
These  frequency  distributions  contain  all  possible  information  relevant 
for  choosing  among  specified  two-stage  testing  procedures. 
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In  actual  practice,  it  is  necessary  to  summarize  somehow  the 
plethora  of  numbers  computed  from  (11 ).  This  h3s  been  done  here  by 
using  the  information  function  1^(0)  discussed  at  some  length  in  Lord 
(1969,  1971)*  By  definition, 


[4^  e(xle)]‘ 


where  x  represents  whatever  test  score  is  used.  For  two-stage  testing 

with  test  score  0  ,  the  symbol  x  is  replaced  by  0  . 

For  a  given  0  ,  the  denominator  of  (12)  is  computed  in  straightforward 

fashion  from  the  conditional  frequency  distribution  (ll).  Denoting  the 

probability  in  (11 )  by  p  ,  we  have 

xy 


nl  n2 

e(0|e)  =  £  £  p  0 

x=0  y=0  ^  ^ 


Since  0  is  not  a  function  of  0  , 
xy 


s  x-0  y=0  **  9 


A  formula  for  dp^^d©  is  easily  written  down  from  (ll),  from  which 
numerical  values  of  the  numerator  of  (12)  are  then  calculated  for  given 
0  .  In  this  way,  lg(0)  is  evaluated  numerically  for  all  ability  levels 
of  interest. 

The  information  function  1^(0)  is  (approximately)  an  index  of 
how  effective  the  testing  and  scoring  procedures  are  for  measuring  the 


examinee.  For  a  conventional  type  of  test,  the  value  of  Ix(e)  is 

directly  proportioned,  to  the  number  of  test  items.  The  numerical  value 

of  I  (0)  for  a  single  testing  procedure  ordinarily  is  not  interpreted 

by  itself,  but  only  in  comparison  to  the  value  of  1^(0)  for  some  other 

procedure.  Thus,  if  I~(s)  for  one  procedure  is  r  times  as  large  as 

I~(e)  for  a  second  procedure,  this  is  to  be  interpreted  as  representing 
9 

an  improvement  in  measurement  effectiveness  equivalent  to  that  obtained 
by  lengthening  a  conventional  test  r  times. 


Explanation  of  Figures 


Figure  1  shows  the  information  functions  for  five  different  testing 
procedures.  The  two  solid  curves  are  benchmarks,  with  which  curves  for 
three  two -stage  procedures  are  compared. 

The  "standard"  curve  shows  the  information  function  for  the  number - 
right  score  on  a  60-item  peaked  test  of  the  conventional  type.  The 
items  all  have  the  same  difficulty  level,  b  ,  and  the  same  discrimi¬ 
nating  power,  a  . 


The  vertical  scale  represents  amount  of  information  obtained,  as 
a  function  of  ability  level,  0  ,  the  latter  being  shown  along  the 


horizontal  scale.  Instead  of  drawing  a  different  information  curve 
for  each  pair  of  values,  a  and  b  ,  that  is  of  interest,  it  very  con¬ 
veniently  turns  out  to  be  possible  to  choose  the  units  of  measurement 
for  the  horizontal  and  vertical  scales  so  that  a  single  information 
curve  will  be  valid  for  any  a  and  for  any  b  .  This  has  been  done 


for  the  figures  shown  here,  which  explains  why  the  scale  values  shown 
along  the  horizontal  and  vertical  scales  are  functions  of  a  and  b  . 

Only  information  curves  symmetrical  about  9  «b  were  investigated 
when  c  *  0  .  For  this  reason,  only  the  left  portion  of  each  curve  is 
shown  in  Figure  1* 

Although  we  are  directly  concerned  here  with  testing  single  indi¬ 
viduals  (there  may  be  just  one  examinee,  not  a  group),  the  reader  needs 
to  know  what  range  of  0  is  of  concern  to  him.  If  a  60 -item  peaked 
test  with  c  *  0  ,  b  «  0  ,  and  a  =  1.00  is  administered  to  a  group 
in  which  d  is  normally  distributed  with  u.  **  0  and  o  «  0.5  ,  the 

U  U 

test  reliability  will  be  0.90  (see  Lord,  19^9 >  section  k).  If  a 
reliability  of  .90  is  roughly  what  the  reader  would  expect  for  60-item 
tests  and  examinee  groups  that  he  is  concerned  with,  and  if  his  groups 
have  roughly  a  normal  distribution  of  ability,  then  roughly  two -thirds 
of  his  examinees  should  fall  between  9  •  -0.5  *nd  +0.5  ,  that  is 
(since  b  ■  0  and  a  «  l  ),  between  9  »  b  -  0.5/a  and  9»b  +  0.5/a  . 

If  the  reader  is  interested  only  in  this  subrange  of  ability,  he  will 
not  find  it  profitable  to  use  two-stage  testing  of  the  kinds  considered 
here.  It  Is  assumed;  therefore,  that  he  may  be  interested  in  the 
range  from  9  ■  b  -  1.5/a  to  9  ■  b  +  1.5/a  ,  or  perhaps,  from 
0  ■  b  -  l/a  to  0  >  b  +  l/a  . 

Suppose,  next,  that  the  reader  is  concerned  about  a  testing  situation 
where  c  »  0  ,  b  ■  0  ,  a  ■  .50  and  ■  0  ,  •  1.0  .  The  test  items 
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here  are  only  half  as  discriminating  as  those  considered  before,  but  the 
group  tested  is  twice  as  heterogeneous*  These  changes  offset  each  other, 
so  that  a  60-item  peaked  test  will  again  have  a  reliability  of  0*90*  The 
reader  will  still  look  at  the  same  segments  of  the  information  curves  as 
before,  since  the  range  from  6  «  b  -  1*5 /a  to  9  ■  b  +  1*5/®  ,  for 
example,  will  still  (since  a  *  *50  )  cover  6  standard  deviation  units 
of  ability  for  the  group  tested* 

Suppose,  next,  that  the  reader  is  concerned  with  an  unusual  testing 
situation  where  a  60-item  peaked  test  typically  has  a  reliability  of 
.80.  This  may  occur  either  because  his  items  have  low  values  of  a  or 
because  his  examinees  are  rather  homogeneous .  A  reliability  of  *80  will 
be  found  if  a  «  .33  and  orQ  ■  1  ;  or  if,  alternatively,  a  *  1  and 
a  •  *33  •  In  either  case,  the  range  from  9  ■  b  -  l/a  to  9  *  b  +  l/a 
covers  6  standard  deviation  units  of  ability  in  the  group  tested.  In 
this  case  the  reader  will  probably  wish  to  ignore  the  left  third  of 
Figure  1  as  representing  extreme  abi^ty  levels  so  rare  that  they  can 
be  neglected. 

Suppose,  finally,  that  the  reader  Is  concerned  with  an  unusual 
situation  where  a  60-item  peaked  test  typically  has  a  reliability  of 
*97*  This  would  occur  if  a  «  1.00  and  «  1  ,  or  if  a  *  .JO  and 
■  2.  In  this  case,  the  range  from  9  ■  b  -  1.3/a  to  9  »  b  +  1.3/a 
covers  only  3  standard  deviations,  representing  the  middle  87  percent  of 
the  group  tested. 

There  is  no  assuay>tlon  of  a  normal  or  other  frequency  distribution 
underlying  the  figures.  The  point  is  simply  that  the  reader  needs  to 
know  what  range  of  9  is  of  Interest  to  him.  If  his  examinees  are 
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asymmetrically  distributed,  or  if  he  is  chiefly  interested  in  only 
part  of  the  ability  range  of  the  group  tested,  then  he  will  pick  the 
portion  of  Figure  1  that  interests  him  accordingly. 

In  choosing  among  two-stage  testing  procedures,  a  procedure  can 
be  eliminated  if  computations  show  that  its  Information  curve  is  always 
lower  than  the  curve  of  some  other  procedure,  regardless  of  9  level. 
Commonly,  however,  information  curves  cross,  showing  that  one  procedure 
provides  better  measurement  at  certain  ability  levels,  whereas  another 
procedure  is  better  at  other  levels. 

As  already  pointed  out,  an  examiner  who  wants  accurate  measurement 
for  typical  examinees  in  the  group  tested  and  is  less  concerned  about 
accurate  measurement  at  the  extremes  should  use  a  peaked  conventional 
test.  If  a  two-stage  procedure  is  to  be  really  valuable,  it  will 
usually  be  because  it  provides  good  measurement  for  extreme  as  well  as 
for  typical  examinees.  Tor  this  reason,  the  main  effort  in  the  present 
study  has  been  to  find  two-stage  procedures  with  information  curves 
similar  to  (or  better  than)  "up-and-down"  curves  shown  in  the  figures. 
These  last  are  benchmark  curves,  chosen  as  the  "best"  of  those  obtained 
by  the  up-and-down  method  of  tailored  testing  (see  lord,  1969)*  The 
up-and-down  curve  shown  here  in  Figure  1  is  the  curve  labeled  ad  -  .20  , 
o  ■  0  shown  titers  in  Figure  7*6, 

Results  for  60-Item  Tests  with  Mo  Ouessina 

Surprisingly,  Figure  1  shows  that  whsn  there  is  no  guessing  it  is 
possible  to  spproximats  the  mess ureasct  efficiency  of  s  60-item  up-and- 
down  tailored  testing  procedure  by  s  60-item  two-stage  procedure 
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throughout  the  ability  range  from  9  =  b  -  1.5/a  to  Q  =  b  +  1.5/a  • 

The  effectiveness  of  the  two -stage  procedures  shown  falls  off  rather 
sharply  outside  this  ability  range,  but  this  rai*ge  is  adequate  or  more 
than  adequate  for  most  testing  purposes  (as  explained  in  the  preceding 
section). 

The  label  "11;  ±1,  ±.5"  indicates  that  the  routing  test  contains 
n^  =  11  items  (at  difficulty  b  ),  and  that  there  are  four  alternative 
49-item  second-stage  tests  with  difficulty  levels  b  -  l/a  ,  b  -  .5/a  , 
b  +  .5/a  ,  and  b  +  l/a  .  The  cutting  points  on  this  routing  test  are 
equally  spaced  in  terms  of  number-right  scores,  x^  :  if  x^  =  0-2  , 
the  examinee  is  routed  to  the  easiest  second-stage  test;  if  =  3-5  , 
to  the  next  easiest;  and  so  on* 

The  label  "7;  i 1.125,  .3125"  is  similarly  interpreted,  the  examinees 
being  routed  according  to  the  score  groupings  x^  ■  0-1  ,  x^  -  2-3  , 
xx  -  4-5  ,  Xl  -  6-7  .  The  label  "11;  *1.25,  i. 75,  *0.25"  similarly 
indicates  a  procedure  with  six  alternative  second-stage  procedures, 
assigned  according  to  the  groupings  «  0-1  ,  x^  »  2->,  ...  ■  10-11  . 

A  ^0-item  up -and -down  procedure  in  principle  requires  l,83u  items 
before  testing  can  start.  In  practice,  600  items  might  be  adequate 
without  seriously  impairing  measurement.  Two  of  the  two-stage  procedures 
shown  in  Figure  l  require  slightly  more  than  200  items. 

The  two-stage  procedures  shown  in  Figure  1  are  the  "Jest,"  also 
the  last  ones  tried,  out  of  sixty-odd  6o-item  procedures  studied  with 
c  “  0  .  None  of  the  two-stage  procedure'*,  that  at  first  seemed  promising 
according  to  armchair  estimates  turned  out  particularly  well*  From 
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Table  1 


Information  for  Various  60-Item  Testing  Procedures  with  c  =  0 


Information**  at 

Procedure* 

0  ■  b-1.5/a 

W/a 

b-0.5/a 

b 

Up-and-down  (benchmark) 

33.5 

34.3 

34.9 

35.1 

7 

±  1.125,  ±  -3125 

32.5 

34.4 

34.5 

35.1 

7 

±1  ,  ±  .25 

31.1 

54.2 

35.1 

35.8 

7 

±1  ,  ±  .25* 

27.O 

31.4 

35.8 

37.0 

7 

±  1.25  ,  ±  .25 

33.2 

33.7 

33.7 

35.1 

7 

±  -73  ,  ±  *25 

28.0 

33.7 

35.9 

36.5 

11 

±1  ,  ±  .25 

30.4 

34.1 

35.5 

36.8 

11 

±1  ,  ±  *5 

50.6 

34.8 

35.6 

34.9 

11 

*  1.25  ,  ±  .375 

32.6 

34.0 

34.6 

35.5 

3 

±  *75  ,  ±  .25 

27.6 

32.9 

34.9 

35.2 

3 

±  .75  ,  *  .3 

28.0 

33.8 

34.0 

33.4 

7 

±  .75 

26.6 

34.4 

34.5 

31.4 

7 

±  .5 

24.4 

32.9 

36.0 

34.9 

3 

±  .5 

24,5 

32.5 

34.5 

34.4 

•All  cutting  points  are  equally  spaced,  except  for  the  starred  procedure, 
which  has  score  groups  ■  0  ,  x^  ■  1-5  ,  x^  ■  4-6  ,  x^  »  7  • 

2 

••All  information  values  are  to  be  multiplied  by  a  . 
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this  experience,  it  seems  that  casually  designed  two-stage  tests  are 
likely  to  provide  fully  effective  measurement  only  over  a  relatively 
narrow  range  of  ability,  or  possibly  not  at  all. 

Discussion  of  Results  for  60-Item  Tests  with  Mb  Guessing 

Table  1  shows  the  information  at  four  different  ability  levels 
obtainable  from  some  of  the  better  procedures.  The  following  general¬ 
izations  are  tentative  and  may  not  hold  in  situations  quite  different 
from  those  studied  here. 

length  of  routing  test.  If  the  routing  test  is  too  long,  not 
enough  items  are  left  for  the  second-stage  test,  so  that  measurement 
may  be  effective  near  Q  ■  b  ,  but  not  at  other  ability  levels.  If 
the  routing  test  is  too  short,  then  examinees  are  poorly  allocated  to 
the  second-stage  tests.  In  this  case,  if  the  second-stage  tests  all 
have  difficulty  levels  near  b  ,  then  affective  measurement  may  be 
achieved  near  e  ■ b  but  not  at  other  ability  levels;  if  the  second- 
stage  tests  differ  considerably  in  difficulty  level,  then  the  mis- 
allocation  of  examinees  may  lead  to  relatively  poor  measurement  at  all 
ability  levels.  The  results  shown  in  Table  1  and  Figure  2  suggest  that 
n^  •  5  ia  too  small  and  n^  ■  11  is  too  large  for  the  range  b  ♦  1,5/a 
In  the  situation  considered,  assuming  that  no  more  than  four  second- 
stage  tests  are  used* 

gusher  of  second-stage  tests*  There  cannot  usefully  be  more  than 
n^  second-stage  tests.  The  number  of  such  tests  will  also  often  be 
limited  by  considerations  of  economy.  If  there  are  only  two  second- 
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stage  tests,  good  measurement  may  be  obtained  in  the  subranges  of 
ability  best  covered  by  these  tests,  but  not  elsewhere  (see  "7;  ± .75" 
in  Table  l).  On  the  other  hand,  a  short  routing  test  cannot  make  suf¬ 
ficiently  accurate  allocations  to  justify  a  large  number  of  second-stage 
tests.  In  the  present  study,  the  number  of  second-sta.-.e  tests  was  kept 
as  low  as  possible;  however,  at  least  four  second-stage  tests  were 
required  to  achieve  effective  measurement  over  the  ability  range 
considered. 

Difficulty  of  second-stage  tests.  If  the  difficulty  levels  of  the 
second-stage  tests  are  all  too  close  to  b  ,  there  will  be  poor  measure¬ 
ment  at  extreme  ability  levels  (see  "7;  t. 75>  ±.25''  in  Table  l).  If 
the  difficulty  levels  are  too  extreme,  there  will  be  poor  measurement 
near  $  »  b  . 

Cutting  points  on  routing  test.  It  is  clearly  important  that  the 
difficulty  levels  of  the  second-stage  tests  should  match  the  ability 
levels  of  the  examinees  allocated  to  them,  as  determined  by  the  cutting 
points  used  on  the  routing  test.  It  is  difficult  to  find  an  optimal 
match  by  the  trial -and -error  methods  used  here.  Although  many  computer 
runs  were  made  using  unequally  spaced  cutting  points,  like  those  indi-  | 

cated  in  the  footnote  to  Table  1,  equally  spaced  cutting  points  turned 
out  better.  This  matter  deserves  more  careful  study* 

Results  for  15-Item  Tests  with  Wo  Guessing 

Some  40-odd  different  procedures  were  tried  out  for  the  case  where 
a  total  of  n  *  15  items  with  c  >  0  are  to  be  administered  to  each 
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Table  2 


Information  for  Various  15 -Item  Testing  Procedures  with  c  =*  0 


Information**  at 

Procedure* 

d  *  b-1.5/i 

£  Wa 

b-0.5/a 

b 

Up-and-down  (benchmark) 

7.6 

7-9 

8.1 

8.2 

3;  ±  1.25,  ±  *5 

7*6 

7.8 

8.0 

8.2 

3;  ±  1,25,  ±  .25 

7.4 

7*8 

8.0 

8.5 

3;  ±  1  ,  ±  .25 

7-0 

8.0 

8.4 

8.7 

7;  ±  1,25,  ±  .5 

6.5 

7.6 

8.4 

8.5 

5}  ±  1.5  ,  ±  1  ,  ±  -5 

7.2 

7-7 

8.0 

8.1 

4;  ±  1  ,  0* 

7.1 

8.0 

8.0 

8.0 

2*  ±  1  ,  0 

7.2 

8.0 

8.0 

7.9 

?}  ±  *25 

4.8 

7.1 

8.7 

9.1 

7;  ±  1 

6.2 

7.8 

8.0 

7.5 

♦All  cutting  points  are  equally  spaced,  except  for  the  starred  procedure, 
which  has  score  groups  x^  =  0-1  ,  x^  *  2  ,  =  3-4  • 

♦♦All  information  values  are  to  be  multiplied  by  a2  . 
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examinee.  The  "beat"  of  these— those  with  information  curves  near  the 
up-and-down  benchmark— are  shown  in  Figure  2.  The  benchmark  here  is 
again  one  of  the  "best"  up-and-down  procedures  (see  Stocking,  1969, 

Fig.  2,  curve  labeled  "A"  and  "ad  =  .50"). 

Table  2  shows  results  for  various  other  two -stage  procedures  not 
quite  as  "good"  as  those  in  Figure  2.  In  general,  these  others  either 
did  not  measure  well  enough  at  extreme  ability  levels,  or  else  did  not 
measure  well  enough  at  9  =  b  .  The  results  for  n  =  15  seem  to 
require  no  further  comment,  since  the  general  principles  are  the  same 
as  for  n  =  60  . 


Results  for  60-Item  Tests  with  Guessing 

About  75  different  60-item  two-stage  procedures  with  c  *  .20  were 
tried  out.  The  "best"  of  these  are  shown  in  Figure  5  along  with  an 
appropriate  benchmark  procedure  (see  Lord,  1969,  Fig.  7.8,  curve  labeled 
"ad  =  .25  ,  H  =  1  ,  L  =  2"). 

Apparently,  when  items  can  be  answered  correctly  by  guessing,  two- 
stage  testing  procedures  are  not  as  effective  for  measuring  at  extreme 
ability  levels  as  are  the  better  up-and-down  procedures.  Unless  some 
really  "good"  two -stage  procedures  were  missed  in  the  present  investi¬ 
gation,  it  appears  that  a  two -stage  test  might  require  ten  or  more 
alternative  second  stages  in  order  to  measure  well  throughout  the 
range  shown  in  Figure  3*  Such  tests  were  not  studied  here  because  the 
cost  of  producing  so  many  second  stages  may  be  excessive.  Very  possibly, 
a  three -stage  procedure  would  be  preferable. 


Standard 


Fig.  3  Two-Stage  Procedure  n=60,  c=. 
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When  there  is  guessing,  maximum  information  is  likely  to  be  ob¬ 
tained  at  an  ability  level  higher  than  9  =  b  ,  as  is  apparent  from 
Figure  3*  This  means  that  the  examiner  will  probably  wish  to  choose 
a  value  of  b  (the  difficulty  level  of  the  routing  test)  somewhat 
below  the  mean  ability  level  of  the  group  to  be  tested.  If  a  value 
of  b  were  chosen  near  n  ,  the  mean  ability  level  of  the  group,  as 

•  C7 

might  well  be  done  if  there  were  no  guessing,  then  the  two-stage  proce¬ 
dures  shown  in  Figure  3  would  provide  good  measurement  for  the  top 
examinees  (above  9  -  b  +  l/a  )  but  quite  poor  measurement  for  the 
bottom  examinees  (below  9  =  b  -  l/a  ).  If  an  examiner  wants  good 
measurement  over  two  or  three  standard  deviations  on  each  side  of  the 
mean  ability  level  of  the  group,  he  should  choose  the  value  of  b  for 
the  two-stage  procedures  in  Figure  3  so  that  |J  falls  near  b  +  -75/a  . 
In  this  way,  the  ability  levels  of  his  examinees  might  be  covered  by 
the  range  from  9  =  b  -  .75/a  to  9  =  b  +  2.25/a  ,  for  example. 

The  three  two -stage  tests  shown  in  Figure  3  are  as  follows.  Test 
68  has  an  11-item  routing  test  with  six  score  groups  x1  ®  0-3, 4.5-6, 
7-8,9-10,11  ,  corresponding  to  six  alternative  second-stage  tests  at 
difficulty  levels  bg  where  a(bg  -  b)  ■  -1.35*  -*65,  -.325,  +.25, 

+.75,  and  +1.5  .  Test  69  has  a  17-item  routing  test  with 

«  0-5,6-7,8-10,11-13,14-15,16-17  and  a(bg  -  b)  -  -1.5,  -.75,  -.25, 
+•35,  +*9,  +1*5  *  Test  65e  has  an  11-item  routing  test  with  ■  0-2, 
3-4,5-6,7-8,9-10,11  and  a(bg  -  b)  »  -1.5,  -.9,  -.3,  +.2,  +.6,  +1.0  . 

A  table  of  numerical  values  would  be  bulky  and  will  not  be  given 
here.  Most  of  the  conclusions  apparent  from  such  a  table  have  already 
been  stated. 
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No  investigations  have  been  carried  out  for  shorter  (n  <  60) 
two -stage  procedures  with  c  =  .2  . 

Summary 

Various  two-stage  testing  procedures  were  studied,  using  a 
mathematical  model  provided  by  mental  test  theory.  When  the  test  items 
cannot  be  answered  correctly  by  guessing,  certain  two-stage  procedures 
were  found  to  be  about  as  effective  over  the  ability  range  of  interest 
as  were  the  "best"  of  the  up-and-down  tailored  testing  procedures 
studied  previously  (Lord,  1969).  When  low-ability  examinees  are  able 
to  answer  all  items  correctly  at  least  20  percent  of  the  time,  however, 
no  two -stage  procedure  was  found  that  matched  the  effectiveness  of  the 
"best"  up-and-down  tailored  procedures  over  this  ability  range. 

This  writer' s  feet -on-the -desk  designs  for  two-stage  procedures  were 
found  to  produce  comparatively  poor  results.  Careful  preliminary  investi¬ 
gations  may  be  required  in  order  to  obtain  effective  measurement  over 
a  wide  range  of  ability. 
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